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ABSTRACT 



Procedures used to compare the results from item response theory 

■ -. . ■ ■. ■ ' ■ ' '■ ^ . " ; ' V' - 

as well as more traditional equating Hie thods were described and critically 
analyzed. The impllcatioM of . the comparison of equipercentile, linear, \ 

■• • • ' ;^ • ' ' ■ ■ ' ■ . ' V 

one-parameter (Rasch), and three-parameter methods for equating twelve \ 
forms of each, of the fiVe tests of General Educational Development (GED) 
were discussed. ' . 

The use of factor analyses to assess test dimensionality, gxaml fl a- 
' t^^on 0f» equating curves, examination of item parameter estimate? for ex- 
tremes, compariiBbn of equating sample -means and variances, and xross- / 
validation analyses were recommended f or lise by testing programs contem- 
plating a switch, from traditional to item response ;theqiry equating-. The 
th^e-prarameter equating method produce4 unacceptable equating results— 
popsibly^ because only. 200 examinees per equating form were used. Thfe one- ^ 
parameter (Rasch) method projiuced results which wfere as stable as those 
for the traditional metliods. ; 



Comparison of Four trocpsurel for Equating the Tests- 
of General EducitionAl 'Development . s 

■ ■ ■ ■ \ *• ■ . • '-^ ' . • 

Host jtesting programs construct comparabljB fbrms of their tests 

for a variety of reasons such as maintaining test, security and enabling 

an individual Xlo take a te^t more than once, /Appropriate test score 

equating procedures allow the fotms to be used;iaterchangeably without 

serious question as to the comparability of common, scale scores. Appro-, 

priate' test score equating enables us t:o say, that, "A score of 50 means 

the same thing, whethet it is earned on form 1 or forjn. 2," While ec^ui- 

percentile and; linear equating methods (Angoff, 1971) are the most widely 
*, . '* ■ ^ * ■» ■ ''■ , ' . , . 

accepted^ procedures for equating tests, item response-theory methods have 

recent;4:y been advocated as being more fl^ibl'e (Lord, 1980 and ^ Wright and 

stone, ;i980) . ' • i , . * 

' Many testing programs might' want to convert from equlpercentile 

or linear eqiiating to item response theory methods in order to gain this 

addiirional f leCxibility . Althou^/the conversion p'robably could be iusti- 

. ', .' . , ' ' , 

fied 'Qnly if it were not accoii?)anied by a loss in equating accuracy or 

^•■i ■ . , ■ . ' ■ * ■ ■ 

precision, few systiematic and objective procedures for comparing the ade- 
quapy of results from two or more equating methods e3d.st. . Thus, the 
practitioner has no prespecified set of systematic and objective proce- 
diires to aid in choosing among competing methods. 

The major purpose of the present study was to describe and / 
critically analjjze^ the procedures used to compare the rissults from item 
response as well as more traditional equiating methods, as applied to the 
equating of twelve formSoOf each of the five tests of General Educatioiial 
Development (GED). The procedures included the use of cross-validation 



criteria and factor analyses were used ^ojsJMi^^ degree, of unidimen- . 
sionality of the GED teatav 

Anbtl^r purpose to prese^t^ j^H^^HL^f rom- an empiri^^ com- 




\)arison bf eqxilpercentiie, linear, one-^lia^'SBH^sch) logistic itep 
response theory,; and thre^e-iparameter logistX^ theory methods 

for eqiia ting the feED tests. T^^^^ GED tests are ^cMevement tests adminis- 
tered to approximately one-half ^^o^^ qualified 
individuals earn a high school equivalency **^«|^"- certificate^ The 
ijnpiicatibna of the findings i^or using item response theory methods tor 
e<tuate the GED as well as orther tests weye discussed. . 

The tes't forms wer J; administered to exatainees in pairs (Design H - 
Angoffi 1971). Although ihe procedures used to compare] equating methods 
can be used .with any equating design, most of them require the existence ^ 
and".u^e of a representative crbss-validation group which has taken the 
two equated fonnd. Also discussed was a procedure which can be use^ when 
randomly equivalent groups of cross-validation examinees have taken the 
two equated forms. The results from the equating methods were compared • 
on the , raw score scale. ' 

Equating Requirements 

Lord (1980) -stated three requirements for methods of equating 
two unidimensional tests. The first was referred to as equity. Assume 
that a group of individuals of exactly the same ability has been iiienti^ 
fied and that each individual has taken the same pair of equated test 
forms. Also, assume that the scores on the setond form have been con-» 
verted to the score scale of the first form using the equating results. 
For equity to l^old, the distribution of scores on the first form must be 



identical to the distribution of converted scores on the second form. 
This property must hold for a group of individuals at >any given ability . , 
Note that £hfi equity implies t:hat the distribution of scores bn t^xe two 

• • • ' ■ ' • ' * , ♦ s - . . . * ■ ; " _ 

forms, after conversion to ,a common score scalet, must be identical for 
any grpup of individuals ,>;rejgardless of the distribution of ability in 

the group. . ' ■ ' * - - 

The second property was- referred to as invariance across groups.--^ 
For invariance*' to holdj the equating results must be, the same, regardless 
of the group of individualia .used to equate ''the tests. ^ 

The third property is symmetry, That^is, the equating should be ' 
the same regardless of which test is equated to the scale of the other. 
This rules out/ for example, linear regression as an equating method. 

Lord (1980) showed that eqxiating of observed scores can- be ex- 
pected to meet the ^equity and invariance Requirements only when the tests 
to be equated are either identical, in which case equating would not be* ^ 
needed anyway, or perfectly reli^bj-e, a -condition which wiA. not occur 
Hn practice. It seems reasonable howeyer, that equats^ng methods ^ich 
come closest to meeting the equity and invariance ^requirements for ob- - 
served scores should be preferred. ^ 

A few empirical studies which examined the equity and invariance 
requirements have been completed. These are reviewed in Kol^n (1981). ^ 
In general, the results are inconclusive as to which equating methods are 
to >e preferred in most practical situations. The relative degree tb 
which the requirement of equity was met was the primary procedure used 
to compare the equating methods in the present study. ' 

* "^o assumptions are required when item response theory methods r 
are used to scale tests. Fi'tst, the test must be unidimensional, or 



-alternatiyely,* the items must exhibit' the properV of local independence 
(Lord and Novick, 1968 )> Although no completely ^atisf actq^y^racedures, ^ V 
for' assessing test dimensionality exist. Lord' (1980) suggested using ^ ' 

• results f rod' a factor analysis of the inter-item tetrachoric correlation 
laatrlx. This pricedure was used in the present* study. - ■. • ^ 

ifce?- s asstunptiop is that the iteo response curves follow the 

prespecifi^a functional fonn. This assumption was addressed indirectly 
in the present study by comparing the equating results from the^ it&n re- ' 
«ponse and traditional equating methods with respect .to the equity require- 
ment. . 

■ ■ ■ : -. . ■ ■„' ■ ' •' "'^ 

. V ' . . ■. ^ , ' • ^ / . ^ 

^ The GED Tests ^ 

• , ' f •••• 

■ ■■ * . ' . . . ■ ' '\ ■ 

^ • The. GED tests are used to evalua^te learning in eyeryday life, 
Enabling qualified individuals to earn high school equivalency diplomas 
o^ceitificates. Through the 'CED Testiiig Service of the Aneri Council 
on Education; the tests were administered to nearly one-half million 
(candidates in 1979. According to the GED Teacher's Manual (1979 /p. /5) , 
"The GED tests are^ designed to measure,, as nearly as possible, the major 
and las ting outcomes and skills generally . associated with f ou^ years of 
regular high school instruction." 

The GED consists of tests/in each of five subject matter areas. 
The Writing Skills test (80 items) contains items in spelling, capitaliza- 
tion and punctuation, usage, sentence correction, and logic and organizar 
tion. The Social Studies test (60 items) contain U.S. history, economics, 
geography, political science, and behavioral science items. The Science 
test (60 items); contains items from biology, garth sciences, chemistry,, 
arid physics. The fourth test, Reading Skills (AO items), contains . prac- 



tlcal reading, general reading, prose literature, poe4:ry, and draina Items. 
The Mathematics tes^(56 Items) contains arithmetic, geomietry, ^nd algebra 
Items. «The Items on the Mathematics test, for the most part, are story 

prc^lems rather than straight coz^putatlonal Items, . ^ • • 

- ■ ^ V ' ' • • ^*/ - ' • . • 

;r Blocks of items, where two or more Items relate* to a common 

<^stimiilus,\are Included £n substantial numbers on the Social Studies, 

■ - * 

Science, and Reading Skills tests. The common stimuli consist ^ofVpassages, 
graphs, charts, ^etc. Approximately one-ttilrd of the'Soclal Studies Items, 
two- thirds of the Sdejice Items, and aj^st'^^ll 'of the , Reading "Skills . • 

ltem£f are contained In common stimulus blocks. A more detailed descrlp- 

< ■ 

tlon of the tests Is presented In the 6£D Teacher's Manual (1979) ^ 

# • ^ • • . ^ .. ■ ■' ■ . 

Because^ of the apparent content heterogeneity of the GED' tes^s and the 
Inclusion of many commoi^ stimulus blocks; assessment of thW degree of unl- 
dimensionality of each test .prior to the item response theory analyses ^ 
seemed very desirable. . \ , N 

Test Developme'nt 

Twelve forms of each of the five GED tests were developed and 
standardized by; thet Educational Testing Service (ETS) between December « V, 

and January 1978. The tests were constructed In conjunction with 
subject-matter', advisory panels. In Spring 1977, the current GED tests ^ 



were'standardlzed and equated using a carefully selected, stratified ran- 
dom sample ^df Wgh. school students in the United States. Kuder-Rlchardson 



20 reliabilities of the fprms radged from .84 to .95 across the five tests 

> V;* ''h, '■' ■ 'V > ^ ' % ■ ' - ■ . 

A'Varlet^jf of va^ studies were also completed* These studies and a 

more CQii^lete deficrip of the GED development and standardization are 
provided i^ E^S (1978;) . * J 



1^77 GED Equating Sample, . ' ' . 

The 1977 GED equating sample data were used in the present study; 
for this .r^so^n they will he described in detail. Thfi de^i^ forjc^bllect- ■ 
inl tWdata involved randomly sampling 294 school districts stratified 
br public-private, geogfaphic region, and socio-economic status, from 
among U.S.' school districts.. One high school was randomly sampled from 
each district and 22 students were to be sampled from each school for use 
in thfe equating portion of the studiiBS. * • 

The tg^lv^ GED test forms ihcluded in the present study wiU. be 
referred to as the anchor 'form and equating fortis one through eleW. 
Each examinee was administered the anfehor form and one randomly selected, 
equating form of twpTof the GED tests. The order^n whi^h the forms were . 
administered las counterbalanced. These procedures resulted .in 2227, 2278, 
2267, 2269, and 2244 -usalile anchor form/ equating form pairs for the 
Writing Skills, Social .Studies, Science, Reading Skills, and Mathematics 
tests, respectively. Approxl^tely . 205 examinees were administered each 
.equating form of each of the tests, 

. Procedure i 

Principal axis factor analyses were completed to 'examine the uni- 
dimensionality assumption, tetrachoric v:orrelatiou matrices for each 
test were factored using the squared multiple correlation of an itent with 
all other ite^s communalities. The degree of tmidUnensionality ex- 
hibited by each teSTMo^ assessed through examination of the eigenvalues. 

Twelve forms of each, of the five GED tests were separately equated 
using eAuipercentile and linea^ equating, methods as well as one-parameter 
(Rasch) a^d three-parameter logistic estimated true score equivalents 



jaquatirig laethodSi 'The it^Siarailieter estiinat"es were examined for extreme 
values, the equating- curves |7ere studied, and the equating results were 

compared usjjig a cross-validation saj^le. The data source and the prd* 

*■ * ' '."^ ' ■ \, . . ■ 
cedurfis used for the equating and cross-validation comprise tli^ remainder 

of this section. - 

. . ■ • • • .■■ ^ • 
■ ^^^^ 

Data Source ' 

The 1977 GED equating sample was used as the data source. Note ^ 
that examinees were administered the anchor form an4 one of the equating 
formal a test. Whenever an examinee correctly answered either all or 
none of the items on the anchor form or the equating form, the examinee's 
data for that test were removed from the present «tudy. This procedure 

was followed because item response theory estimation procedures cannot ^ 

* • ■ ■ » 

estimate the ability of individuals earning all or none correct on a form. 
Between 190^ and 218 examinees took each test and equating ftorm combination. 

Twenty examdLnee records were then randomly selected, stratified 
by geographic region an4 ^socio-economic status, from each test and equating 
form codhixiation. These examinees comprised the croes-validation sample 
and were not used in the equating portion of the study. The remaining 
170 to 198 examinees perfr'test and equating form combination will be re-* 
f erred to as the equating sample. 

Equating Methodology ' ^ 

Four equating methods were used to equate the GED forms using ^ 
the equating sample data. Linear and equipercentlle methods^ are discussed 
togethet and referred to as traditional equating methods. One-parameter 
logistic (Rasch) and thl^ee-parameter logistic methods are discussed to- ^ 
gethe^: and referred to as item response theory e^qoa ting methods. 



, Tradltloaal equating .^ The anchor and equating form pairs of each 
test weire equated separately. For example, the 3,98 equating sample 
exfflin«es taking both 4:he-ancfior fo'na and equktihg- f orm one of the Writing 
SKiils test were used to equate form one to the anchoi form taw score 
scale. . iMethod IA-1 described- .by Angoff (1971) was used -for linear equat- 
ing and Method IA-2 was used for equipercfn^ile equatiAg. 

' For linear equating^ whenever the anchor form equivalent of an 
equating fo'rm, one score was ^above the highest' possible score on the anchor 
form, it was fixed at the highest possible anchor form score. A similar* 
procedure was followed whenever -.the anchor fora-equivalent'^Ja^elow a 
score of zero. For equipercentile equating, linekr interpolation, as 
opposed to smoothing, wa^ used yhen necessary. Identical procedures were 

• followed in the equating of equating forms one through eleven scores to 
the anchor f orm , raw score scale, for each ot the f ive GED tests. 

• Item response theory methods . The first step in i-t^m response 

-theory equating was to estimate the item and ability parameters for tte 

one-parameter and three-parameter logistid models. TKS LOGIST computer 

program of Wood, Wingersky, and Lord (1976) was used for 

The anchor form and equ4tiilg form one of the Writing Skills test 

will be used as an exanple. The 'item parameters for the 80 anchor form 
items and the ability parameters for the 2,227 equating sample examinees 
who .took thfe W?iting Skills test were estimated using LOGIST. The -abil- 
ity parameters for the .198 examinees who also were 'administered equating 
form' one were then fixed. These fixed ability. estimates along with the 
item responses of these 198- ei^neefe were then entered into LOGIST. .• 

• Because. of small sample siies, the "pseudo-c-hance" parameters for the ' 
equating forms were fixed at the modal anchor form value of the corre- 



ponding anchor test.^ This produ<ied eqiiating form one item paraneters 

. ,■■ ■ ■ f - ' , ., :. ■ . - ■ , ■. v., ■ . -• 

on the s^e scale as the anchor fom. estimates. Similar procedures were ^ 
followed for equating forms, two through eleven of 'the Writing Skills test/ 
These procedures were .also followed for the other four GED, &ejfetS using 
both the one-parameter and three-parameter logistic item resporise theory ' 
models. ^ 

The next stage in the equating process was to derive anchor form ^ 

• • • • ' '• 

score equivalents of equating form scores using estimated^ true score 
equating (Lord, 1980). The estimated true score of an examinee with a 
given (estimated ability is equal to the sum, over- items, of the estimated 
"probability of correctly answering each item. Using non-linear estimation 
procedures, anchor form estimated true score equivalents ot equating form 
one through eleven integer scores were calculated. The procedure was 

followed for the five GED tests using the one-parameter and three-parameter. 

■ . ■' > ■ - ' - 

logistic models. 

Note that estimated true scores below the estimated "pseudo-chance" 
level of a test (the sum of the it^i "pseud<5-chance" parameter estimates) 
are undefined for the three-parameter logistic mo'Sel. Scores of zero on 
any pair of forms were 4rbi*trarily considered to be equivalent; '-^misj^ng" 

^ The modal "pseudo-chance^' level pa.rameters weret 0.150, 0.165,^ 
0.140, 0,200, and 0.150 for the GED Writing Skills, Social Studies, 
Scifence, Reading Skills, and Mathematics tests, respectively. • 

^An attempt waa made to simultaneously estimate all of the iteni 
parameters on the 12 fortns o£"the Writing Skills test using the ti^ree^ 
parameter logistic model. LOGIST failed to converge, however. Lord^- 
(1980, pp. 209-210) suggested a modification to tSe LOGIST program wnicii 
could be expected to solve the convergence problem. The simultaneous 
procedure also cpuld be expected to produce more precise item parameter ^ 
estimates because the ability* estimates would reflqdt perfonaance on all 
items taken rather than the anckbr form items only. The authors swel^e 
unaware of Lord's modification at the time this study was conducted. 



equivalents hetlow the "pseudorch^ via linear interpo- 

latioh* Lord (l98Q, pp. 210i211) addressed This p^rjjblem in a slightly 
different inatmer. : ^ . • - : ^. 



Crbss-Vatidation Methodology • . • 

Th^ twenty randomly selected examinees from each test and equating 

^y., ■ ■ ■ \ ' ■■ : ■ ■ . v.- 

form combination comprised the cross-validation sample. T«e anchor form 
and equating form one' scores on the Writing Skills test wi]JL be used as 
an example in the development of the cross-validation procedures. 

'•• • ■ ' ^ 1 / / : - ' ; • .' ' ■ ■ . • ^ 

The. twenty cross-validation sample examinee scores on equating 

form one were conyerted to the anchor form score scale using the linear 

1. . \ ■ , ■ • •■ . • ■ ^ , ■ ■ • •■ ^ ' 

' method equatijpig table. Let Xi represent the score of , cross-validation 

sample examin^ i' on the anchor form of the Writing-.Skills test. Let Yi 

repreaent the; score of the same examinee equating form one and let Yi 



re|Jresent this equating form one scorer converted to*the anchor form score 
s'cdle usiilg'^the linear method equating table for converting, equating form 
one , scores ; to ihe anchor form scale. The difference between the anchor 
form score (Xi) and the converted equating form onie score, (Yf) for an 
ex^min^e, . . ^ 

" , , • Di = Xi-Y±v • ^ :(1) 

was used as the basis for forming a crosfsJvalidatipn Summary statistic. 

\ The Di» quantities c§uld be sqiiarAd and then averaged over the 
twenty appropriate crbss-vajlidatibh examinees. However, this quantity 
can be broken dowfiL into further components. ^ - 



(2) 



In this equation n is the ntimher of cro^s-validation examinees (20 for 
the present study). The first quantity to the right oft the equal sign 
is the mean value of D, squared. This^ qpntity represents th^ squared, 
mean difference between anchor form and converted equating form one scores. 
It will be referred to as the measure of e^^ating bias. The second quan- 
tity .on the right represents the variance of the differences between 
anchor f^rm and converted equating form one scores and will be referred^ 
to as the measure of equating imprecision. 

_ - > 

• . • ■ ^ . 

Equating bias and imprecision indices were computed separately 

for each.- test and eq^iating- f orm combination. A one-way repeated measures 
analysis of variance was complete4 for each test an4 index combination. 
Form Celeven JLevels— equating forms one through eleven) was treated as the . 
random "subjects" factor and . equating method (four Revels )^ as the fixed- 
repeated meatfurea; factory Tu^cey post-ho"^c paired cpT:q)arisons were also 

Vus^d. ■■ ■ . \" ' ■■ , 

Kolen (1981) developed a cross-validation index which is appro- 

pri#te when randomly equivalent groups take the forms to be equated. The 

index can also be applied vhen'each examinee takes both tests, such ad 

in the ipicesent study. This index will be referred to as the percentile 

comparison index. ' ' ' 

The percentile comparison index is a measure of the dissimilarity 

■ ' ■ • ■ 1- ' ■ ■ ■ • . • . ' " 

between diatributions of anchor form scores and converted scores on an 

equating form. To compute: this .index, the cross-validation distributions 
' were tabulated and percentile ranks calculated separately for the anchor 

form and Converted equating form one scores. The percentile comparison 
/' Index'was formed by finding the difference between each observed anchor 

form score and the converted equating f orm Qne score jrith an identical 



percentile rank iti the conver-ted equating form one distribution, 'ihi^ 
difference was then weighted by the ntimber of individuals earning the 
.anchor form score and summed over the observed anchor form scores. The 
equation is, 

' . Zf-lCXi - Yi")" (3) 



In the equation, X± represents an anchor form integer score, Yi" the 

equating form one score with th& same percentile rank, and fi.the number 

' • - / V 

of examinees that earned Xi. Like the bias and imprecison indices, 

smaller values indicate better performance for the equating method. Re- 

peated measures analyses of variance were cqmpleted for the percentile 

comparison index in a manner similar to those completed for the bias and 

. ■ ■ ' ! 

imprecision measures. 



Riesults 



■ ■ ■ ■ ' ^ . , t ■ 

\ The eigenvalues apd percentages of vari^ce accounted for by' each 

of the first twelve factors in the factor analyses are presented in* 

Table 1. The ratios of the first. to second eigenvalue^ a rough index of 



\ Insert Table 1 About 

Here- 



.unidimensionality, were 7. A, 10.2, 7.4, 9.3, dqd^4.7 for the Writing , 
Skills, Social Studies, Science, Reading Skills, and Mathematics tests, 
respectively. Only the Mathematics test approached having a substantial 
second factor. Overall, the factor analyses suggested that all of the 
tests, Except possibly Mathematic^ were reasonably unidimensional. 



• 13 

The item parameter estimates w£re examined for irregularities. 
Extreme three-parameter model difficulty estimates (absolute value' above 
3.5) 'were discovered for a number of items on the equating forms of the 
Writing Skills,- Social Studies, and Science tests. Very few were found 
on the Reading Skills and Mathematic4 tests. These are reflected in th6 
standard deviations of the three-parameter item difficulty parametefr 
estimates shown in Table 2. 

Xnaert Table 2 About 
Here 

Note that extreme three-parameter discrimination estimates were not re- 
fleeted in the standard deviations in Table 2. However, the discrimina- / 
tion estimates were constrained between 0 and 2 by LOGIST. Also, t^e/ 
discrimination index may not hk on an equal-interval scale; differences 
in, parameter estimates near zero may reflect larger differences in disr ^ 
* crimination than differences at other points. The fact that very low dis- 
crimination estimates tended to accompany the extreme difficulty esti- 
mates pupports this notion. The one-parameter model produced no extreme 
difficulty estimates across all of the tests and forms. It appears that 
problems were encountered l^ia .estimating the item parameters in the three- 
parameter model, especially for the longer tests. 

The equating relationships were also examined^ Figure 1 presei^s 
the equating relationships between the anchor form and egtuatingjEorm one 
of the Reading Skills teat. This pair of forms was chosen because 
equating form, on^ contained no extreme parameter estimates and the r^- 
lationships were fkirly representative • 



Insfrt Figure 1 About • " , 

"... ' . . ■ ■ Here \ 

t-' • • • ' /* • ^ 

Note that the anchor form was generally less difficult tlian equating 

form one. This was true for most forms studied. The mean anchor form 

raw scores were generally from one to three points Jiigher than their 

' ' ' * ' ^. ^ • ' ' -J " 

equating form counterparts across all tests. This result is illustrated 

in Figure Xi 

In the figtjre, thiB> three-jparameter method produced the smallest 
anchor form equivalents of lower equating form one scores. It also pro- 
duced the greatest anchor form equivalents of the higher equating form 
one scores. This result held, for the most part, across all of the forins 
of the GED tests studied. ' ' ^ ^ 

For t(^er. equating form one scores, the one-parameter method curve, 

tend^ to be lower than th^ equipercentile curve which tended to be lower 

h. . ■ ■ • ■ • * ' ■ ■ ■ ■> ■ 

than the linear curve. The reverse appeared to hold for theA higher scores. 
_ ■ ^ • - ■ ■• . ^ " . ' , ; • ^ 

Thjs relationship was present fcrr most of the test forms studied. 

The mea^s of this bias, imprecision, and percentile comparison 

■ - - ' ■ ^' . * ' . ■ ■ ( . ■ ' ' 

cross-validation statistics are presented in Table 3. Tukey critical dif- 

ferences are also presentecf. - . 



Insert Table 3 About. 
Here 
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Due^to^the differences in test lengths, none of these indices should be 
compared across tests. 

The- three parameter method produced the largest bias index for 
each tejst^ The Reading Skills and Social Studies examinations were the 
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only tests for which the equating method F-test in the analysis of vari- 
ance surpassed the .05 critical value. However, the difference in bias 

r\ 

Statistics among methods were not appreciably large. 

The iiiq)recision measure showed more substantial differences. The 
Tukey comparisons indicated that the three-parameter method was more im^' 
precise than the other methods for all tests. No evidence of consistent 
differences among the other methods was found. 

\ The percentile coiD|)arison measure for the three-parameter method 

was largest for each GED test. While none of the paired comparisons sur- 
passed the Tiikey. critical difference, Schefffi compari-sons of the three--' 
parameter method with the mean of the equipercentile, linear, and one- 
parameter methods surpassed the .05 critical value (df -Si 30) for 
Writing Skills (F -3.52), lsopial„ Studies (F « 5.72), Reading Skills 
4.58), and Mathanatics^ (F « 6:64). 

Friedman Statis ties (ConbVer, 1971) were^eaip^ 
the cross-validation indices and for /feach GED , test "became the assump^tion 
of normality was probably violated in' the analyses. Forms were treated 
as, blocks and equating methods as treatments in "the analyses. The Friedman 
statistics vere not regs^rtOkd here because the results were essentially 
equivalent to ihe analyses of variance..- — 

. ■ • ii. 

Discussion . * 

. , • 

Factor Analyses * 

The factor ^tialyses suggested that each of the GED tests, with 
the possible exception of the Mathematics test, were reasonable unidimen- 
sional. This was suggested despite the fact that the GED tests are, in 
general, content heterogenepus and contain many items \diich are presented 



as part of a common stimultia block. Since consistent differences between 
Mathematics test equating results and those for other tests^were not dis- 
covered, the results of dimenionality differences, if' any differences 
did ^aist, were dJTt detected by th^ procedures used. Unidlmensionality ^ 

^may be crucial for item response theory scaling. For this reason^ itg 

* ■ , • 

assessment should be a routine aspect for apy comparison among equating 
methods that includes item response theory methods. 

Examination of Item P arameter Estimates 

— ■ \ \ '. ; ■ ■. * ^ 

The three-parameter estimation procedure produced a number of ex- 

treme parameter estimates, suggesting that difficulties we're encountered 
in parameter estimation. Since these difficulties can be expected/to 
aifecj: equating results, examination of item parameter estimates for ex- 
tremes should be included in equating method comparisons. The existence 
of ^ex;treme unconstrained^ parameter estimates for item response theory 

models requiring constraints on other parameters to achieve convergence 

* ■ ■ ■ ■ ' 

suggest difficulties in item parameter estimation. ^ 

Graphing of Eguatijig Curves ; - 

The graphing and examination of equating xurves suggested that a 
relationship existed ainong the equating curves. The^relationships will 
^ be discussed later. Since relationships discovered can have consequences 

^- " ■ 'i^V'.i- ■■ . ' 

in practice, a graphing and examination of the equating curves caii^1ii|Vs_>^ 

■ . * ■ * ' ■^T'''- , ■ 

very useful when comparing results from aauatlng methods. ' ' ^ 

Some Factors Affecting Che Croas-Yalldatfon 

The findings 'from the cross-validation analyses necessarily de- 
pended on all factors affecting the adequacy oi the equating. Equating 



gr^up sample size and the ^pecifiC equating methods used. are such factors. 
ForS^xample, different findings odL^tif have occurred had smoothing instead 
of linear interpolation been used in equipercentile 'equating or l^Lord's^ 
(1980, pp. 2Q9-210) suggeafied modif icatioiu to- LOGIST for extensive\imul- < 
tan^dus estimation been used instead of the simultaneous procedure used 
her^. Additionally, larger cross-Validation samples can be expected. to 
increase the chances of detecting differences among equat^ing methods when 

differences do exist. — ^ 

The cross-validati^ indices were designed to reflect differences 
between cross-validation/anchor form and converted' equating form (i.e., 
equated to the anchor fonn raw score* scale) distributions for examinees 
taking both forma. Unde+ equity consideratiojis , the two distributions : 
Should' be identical, apart from sampling error. • 

Bias Index . - 

A bias index was calculated for each of- the 5S-fc«st/equating ^rm 
comblHationa. This index -wats the squared difference Tietween the maan 
anchor form a"kd mean converted equating form scores and reflects both 
equating errof and erjyj^ Iji sampling examlifees for the cross-Validation 
group. The three-parameter equating method. tended to produce the largest 
bias indices in the cross-validation. 

. Jhe larger bias indices for .the three-parameter method may haVe 
reflected a combination of sampling eia:<A and imprecision rather than 
bias , ai suck, ' where blas^ ia defined asNthe mean difference between 
anchor and converted e<iuatihg ioilm scores for an infinitely large cross- 
validation group: Consider the following not too unlikely scenario given 
a cross-validation sample of size 20 v ' Suppose that there existed a single 
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examinee in the cross^alidation 8aii?>ie with a very low score on* ohe ♦ 
equating form. When converted to the anchor foinn scale tis^ing e^^^ . ♦ 
curves like those in Figure 1, a lower converted score would be produced 
fpr the three-parameter methods than for the other equating methods studied. 
If the s^li^^happ^ned to contain^ no very high scorjBS to compensate for 
the very low one, then the. bias index for the three-parameter method 
would be higher than the index for the other methods. Other likely scena- 
rios with the same implications for the biaa^ index are possible with small 
cross-validation samples. AThe chances of this phenomenon occurring should 
be minimized as cross-validation ♦sample size increases . 

Thi^ mean difference between anchor and converted equated form • 
scores, it)ver all 220 cross-validation examinees taking the anchor and 
equating forms of^ each test, was calculated to invejsJtigate this hypothesifi. 
Although not presented here, the differetices in means/ were smaller than 
might have^befen expected from tHe bias indices shown in Table 3. In fact, 
for the ^lathematica ifest the mean difference for the three-parameter 
method was closer to zero .than the mean difference for the, other equating 
methods. Hence, the bias index may have been strongly influenced by tWfe qombina- 
tion of imprecision and error in sampling the cross-validation group examinees. 

. The meaning of the larger bias indices foi^ the tjiree-parameter ^ 
method is unclear. The bias index should be carefully interpreted when . 
small cross-valid?.tion sample sizes are "Used. , * 

Imprecision Index 

A separate imprecision index was calculated for each of the 55* 
test /equating form comhinatlDns. This index represents tb^'^v^riance of 
the difference between cross-validation anchor form and converted equating 
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' form combinations. The three-parameter equating method consistently pro- 
duced the largest' valuea of the itiSprecisipn index. . ^ 
^^The imprecisioa index can be decomposed such that, 



1^ \ ' \ 

'irtie quantities sx and Sy' are the observed standard deviations, for 



examinees taking one of the equating iEorms, of the anchor form and con- . 

vetted equating form scores, respectively. The correlation between anchor 

/ . . ^ 

form and coqyertecf equating form cross-validatipn scores is represented ^ 

■■ ' ■ • 

by rxy'. . ' . ^ 

When comparing the imprecision index from one equating method to 
another, Sx will remain constant. The qiiantities Sy' andr^y' c^ vary, 
however. For most forms studied, was largest for the three-parameter f 
method. The quantity^rxy f was not consistently larger *or smaller for any 
of the methods. It appears that the larger imprecision indices for the 

•three-parameter me tnod^ resulted from compatatively larger variances of 
^ . ■ . 

converted. equating form Scores for this method than for any, other. In- 

ap^qtionof Figure 1 suggests how this occurred. Low equating form scores 

were converted tl3 lower*! anchor fbrm scores for "the three-par atiieter method 

than for the other ^equating methods. High equating form scores were con- 

verted to higher anchor form scores for the three-'paraneter than for the 

other equating methods. This wotdd be expected to lead to a larger vari- 

■* ■" ■ " . > ^ ■ * ■ • ' 

ance of converted eqiiating form scores .for the three-paramftter method; ^ 
and, therefore, a^ larger ia?)recia±on index. 

V A problem with, the use of the imprecision index is an^lif ied by 
considering equation (4) . . It can be shown that the use of linear regres-^ 
sion, instead of linear equating, would be expected to lead to. a smaller 
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value Of the imprjcision index. Although linear regression does not 
qualify as an equating methdd, since it does not meet the synimet^%|fequire- 



1 



tnent, an equating method "could look better than it really was?' ±i it pro- 
duced too small a vailiance of converted equating form scores. The per- 
centile comparison index was included as a potential procedure for cir- 
cumventing this problem. • ; 



Percentile CoiapariSfl4 Ittdex ^ \ 

The percentile comparison index was formed by calculating the 
difference between each, integer anchor form score ^nd the converted equating 
form score having an identical percentile rank in the cross-validation 
sample. Each difference was squared and weighted by the number' of cross- ; 
validation examinees earning the correspo^aSg anchor fork score. The 
nM^ri of these squared differences comprised the percentile C^inparison - 
index. A separate index ^Jas calculated for each of the 55 test/equating 
^form colnbinations. - 

It seems that this index will be larger whenever the variances of 
the anchor form and converted equating fqjrm cross-validation scores differ. 
(No proof can be of f ered a^ this is a fairly^omplicated index.) If this 
is true, then the problem of^an equating^thod "looking better than it 

really was" which was mentioned In connection with the imprecision index 

■ • ■ 1 

would be eliminated with the percentile comparison index. 

As wdth. the imprecision index, the percentile comparison index 
t^ded to be largest for the three^parameter equating, method. In both 
cases, the larger variances; of the converted equating form distributions 

for this method wete probably responsihle for the larger values of the 

" ' ' > , 

^"indic^s. ^ . 
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Critique of Cross-^Valldatton Procedures , ^ 

The cross-validation suggested that the three-parameter item 

response theory method, produced^ inferior equating results. This was prob- 

ably a result o/ the thr^e-parameter method producing overly variable con- 

: ■ ^ . • » ' ; ■ ' 

verted equating form. score distibutions. Figure 1 illustrated how this 
probably occurred. . ' , 

y * If 'the variances of the converted equating fprm scores of equating 



)l)e examliltec 

been calculated, the variances fqr the three-parameter method probably 



sample examinees (as>oplposed to cross-validat^P^ sample examinees) had 
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^ would have been largest, f^^s^ variances ^re c^rently^ being examined 
* ^y the authors'.) Therefore, contputation of means and variances of equating 
sample examinees have the capacity to provide useful infotmati^a vhen.com- 
plieted prior to a cross-validation gtudy^ ^- 

The percentrLle comparison/index was the only cross-validation , - 
'indei consider)ld that is appropriate when randomly. equivalent groups take ^ 
the anchor ani^^quatlng forma in the cross-validationi The bias and Im- 
precision indices require that;, the cross-validattpn examinees take both 
forms. Also, the "^percentile comparison index £s" probably not biased in 
favpr of* a procedure like linear regression. However, it appears that 
the percentile comparison Index is le^ss sensitive to differences among ' 
equating method results since it less consistently identified differences 
than did the Imprecjtsion index. The percentile comparison along with the 
bias ^nd Imprecision indices should he used in equating method cross- ^ 
validation .^studies . ' : 

As mentioned previously, the bias Index can be affected by equating 
.imprecision for small samples. Since the mean (bias) and variance (Impre- ; 
cislon) of a distribution are generally not independent quantities j it 
may be beneficial to consider, a composite index of the* two, only.. 

--Vz/^- ^ 
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Note that raw scores were used for all of the cross-vali'dation 
comparisons i If the raw ^cores were linearly converted to standard scores ^ 
so that Wh GED test had the s toe mean and standard deviation, tjien the • 
indices could be compared across tests. We are currently atteii5)ting to 



r 



use standard scores. 



Implications for Test Equating- , ^ 

ThJj>roblems. encountered wl^h the three^parameter equating method 
were at least par£lally a*i?esult of small sample sizes and the methods 
used to estimate the'parameters. In r^vldw, -the ability parameters and 
anchor form Item parameters were' estimated using aver 2,000 examinee - 
recdj^ds for each test. The '^ability parameters wei:,e then fixed, aS were 
the lower asymptote parameters. .The dlfflcultj^nd discrimination param-^ 
eters were estimated, separately, f or eacW of the equating forms using 
' from 17Q to 198 records for examinees taking the form. , ^ - 

The small sample sizes for the equating forms were probably re- : 
sponsible for the extreme parameter estimates discovered. Additionally^ 
it^was notic^ that many examinee's scores on the anchor form and the 
other form of a test taken were very different. A screening procedure, 
for removing examinees whose scores on the two: forms were very different , • 
might have improved the situation.:. Lord's (1980, pp. 209-210) modifica- 
' tion of LOGIST for extensive simultaneous estimation might also have im- 
proved the estimation. . / . 
" A consistent relationship- between^ the equating curves for^the 
iJk" three-parameter method and other methods was found. An individual with 
a lower equating form one score would STpenalized if the three-parameter 
equating curve, as opposed to the curve for any other method, was used 
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to convert tthe score to the anchor form 'scale. Aft individual witK a 
higher equating foria one scoye would have benefitted had the three-parameter 
equating ciitve, as opposed to any of -the other curves, been used to trans- 
form the score to the ancTior form scale. 

* Thi9 might have resulted from the problem encountered in the 
parameter estimation. However, ip another study in which parameter estf- 

, mation did not appear to be problematic, ^olen (1979), disco^^ered a simi- ^ 
lar relationship when SQores on a form were converted tb the scale of 
another form Chat was slightly less difficult. Kolen (198lO( hypothesized 
that this resulted from the condensing of the estimat^ true afcpr^ Scale 

- with the three-parameter model. That is, estimated ti^jle sdoxeis below the 

"pseudo-chance" level of the test (the sum of the lower asymptote param- 

■ . . . -f* i . 

eters) do not exist. Hence, the estimated true score scal^ is a cori7 

densed versic^n oi the raw score scale. Tftis hypothesis takes on greater 

' " ■ " • ^ 

weight when it is realized that the equating curves seem to pasp at^fOr, 
very near the joint raw score means of the equated f^rms. The condensing 
problem can he avoided if estimated observed score equating^(Lord, '1980) 
were used instead of the estimated true score equating used here. 

Any differeaic^s among the one-parameter (Rasch)*, equipercentile_ 
and linear methods which might have existed were not detected by the 
. cross-rvalidation fprocedures . The use of the larger cross-validation 
groups would be expected to lead. to discovery of these types of differ 
ences, if they did exist. . 



Conclusions 




The following procedures were found to provide useful Information 
in. 4 ccrmiiarison of equating^ methods and should be considered for use. in . 
future studies*' 



1, :Tetiacho$ic f^^ analyses to asseas the degree of test uni- 
V- . dlmetdsioiiality.. \ ■ " ' ■ :^ '■ ; ' - ' .: \ 

2, Exaniinatibn of graphs, of equating cti^efe . to detect idiosyn- 



'vcratic results* . : V:\ 



' ♦ 3.- Exandnation of item parameter est^^ extreme estimates. 

* A. Comparisbri of , the equating sample means and. variancep of 'anchor 
. and converted , eqii^tirig f 9 , ' ^ 

^ -5; Cbmpletidn of ^ a dross-validation study including the calcula- 
, ■ (. tibn of bias , ipipr^cisiiiii , and percentile coxapar ison indices . 

The crbss^varikdation ana^ 'of a representative 

groiip of ^c^^ but were not 

included in the equating. The percentile comparison index can still be 
^s^ ^jhen rahdokly equiyaleiit groups 6f *croas-yalidatiori examinees take ; 

■-"the format'- ■" ■ ' ' '■ 

Larger eqiiatring sample sizes ariji/or a modif ication of the ^stim^- 
tion procedure *Wpuid be necessary be$ore the , three-parameter method could 
tie suggested for equatliig.^^;f^^^^ The bthet three equating inethods 

■piroduced Vlmi^ one-parameter (Ri^isch) , 

equating methbd. can expected td produce results 'which are! ds stable'^^aa 

■ t^se for iiiitsar ik^^ 'eqiiipercentile methods' f of equating achievement tests 
. T^lch « 4iii<T flr fo the GED with, sample.sizes around 200 and when both 

■ ' forms d£ )aisd2^Mi£±ciilty hiVe been "adminiistered tb the> s?me. ex^ 
-The three-parameter equating ^iiBthpd : Is : m'ch .less r stable, in ;:his situatipn . 

AdditioSally, investigation of the .possibility of 'scom 
with, the three-i?aran^ter method id w^rr^nted. 
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V • • Table! ■ ' 

, / First 12 Eigenvalues and Percentage of Variance Accounted 

' for In Tetrachorlc Factor Analysis of Anchor Forms 

■ ■ ' ■ ■ ' ' ' • ^ ■ 

^ GEDIest 

Factor Writing Skills Social Studies Science ' Reading Skills Matheniatics . 

• .Eigen- HI Elgenr Eigen- X Eigen- I Eigen- X 

Value Variance Value Variance Value Variance Value Variance, Value Variance 



1 . 


20.79 
2.80 


25.98 


^ 18.66 


31.10 . 


17'.09 


20.ii8 


14.56 


36.40 


■15.07 


30.14* 


2 


3.50 


1.83 


3.05 


2.30 


3.83 . 


1,57 


3.92 

■ 


,3.22 


6.40 


3 


2.07 


,2.58 


1.39 


2.32 


1.87 


3.17 


1.30 


3.25 


1.48 


2.96 


4 


2.00 


2.50 


1.34 


2.23 


■ 


2.43- 


1,16 


2.90 


i.40 


2.80 


5 


1.74 


2.18 


1.25 


2.08 


1.39 


2,32 


1.05 


2.63 


1.21 


2.42 


6 




1.79 


1.19 


1,98 


1.28 . 


2.13 


1.01 


2.52 

V 


1. L16 


2.32 ■ 


7 


1.38 


1.72 


b.16 


1.93 


1.22 


2,03 


1,00 


. 2.50 


1.11 


2.22 


8 


.1.34 


1.68 


1.08 


1.80 


1.14 


1.90 


0.98 


2.45 . 


1.05 


2.10 


9 : 


:L29 


1.61 


1.07 


1.78 


1.12 


1,87 


^0.93 


2.32 


1.04 


2.08 


10 


1.26 


1.58 


■ 1.07 


1.78 


J.08 


1.80 


0.87 


2.18 


1.02 


2.04 


11 


L18 


1.48 


. 1.04 


1.73 


1.55 


1.75. 


0.87 


2.18. 


0.9a 

y , 


1.96 


12 


1.16, 


1.45. 


0,98 


1.63 


' 1.03 


1.72 


0.84 


,2.10 


" 0.95 


l'.90 




Table. 2 



Standard Deviation of Item Response Theory Difficulty 
^d Discrimination Parameter Estimates 



Parameter Estimate 



Test 


Form^ 


Number 

[pt 
Items 


One Parameter 
• Difficulty 


Three Parameter 
Difficulty 


Three Parameter 
Discrimination 


Writing 


Anchor 


80 


0.55 


0.95 


0.27 


Skill? 


' Equating 


880 


0.52 


2.99 


0.32 


Social 


Anchor 


60" 


0.50 


0.90 


0.31 


Studies 


Eqtiating 


660 


0.53 


6.A2 

'4 


0.31 


Science 


Anchor 


60 


0.62 


1.38 


0.A2 




Equating 


660 


0.53 


7.98 


0.30 


Reading 


Anchor 


AO 


0.61 . 


0.91 


0.38 


Skills 


Equating 


AAO 


~ 0.57 


1.18 


0.i9 


Mathematics 


Anchor 


AO 


0.81 


1.28 


0.A5 




Equating 


4AQ 


0,78 


1.61 


0.36 • 



*Porm Equating refers to equating forms one through eleven taken together. 



Table. 3 S ' 

Mean Cross-Validation Indices and Tukey Critical Differences 



GED Test 



Index 


necnoa 












> 




Writing 


Social 


Science 


Reading 


Mathematics 






Skfils^ 


Studies 


Skills 




Bias 


Equipercentile 


3.14 


1,27 


• 

5.02 


0.63 






* • 
Linear 


2.98 


1.30^ 


4.82 


0.61 


1.11 




One Parameter 


3.24 


1.28 ^ 

1 


*^4.38 


0.63 


1.13 




Three Parameter 


3.85 ^ 


X 1.70 


5.28 


1.04 


1.55 




Tukey Critical 
Difference 


1.05 


0.51* 


2.22 


0.32* 


0.70. . 


Imprecision 








* • 








Equipercentile 


63.02 


48.85 , 


48.10 ' 


22.36 


28.76 




Linear 


61.81 . 


47.93 


47.12 


21.93 


28.77 




One Parameter 


66.25 


5Q.23 


49.75 


22.26 


31.85 




Three Parameter 


73.47 


60.41 


57.21 


28.62 


36.74 




Tukey Critical 


5.60* 


8.21* 


5.24* 


3.75* 


3.79* 


Percentile 














Comparison 
















Equipercentile 


17.07 


15.05 


13.99 


7.69 


7.99 




Linear 


15.81 


14.96 


12.35 


7.26 


7.34 




One Parameter 


16.31 


14.67 


1^.02 


7.18 


8.34 




Three Parameter 


20.20 


- 20.72 < 15.34 


10.15 


11.32 




Tukey Critical 
Difference 


4.5*1* 


5.42* 


4.53 


2.95* 


2.96* 



*The equating methods main effect surpassed the .jjjfe level of significance 
in the analysis of variance. 
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"ngure 1. Equating relationships between the anchor fom and equaling fota one. of 
the GED Beading Skills test for four equating methods. 
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